I would like to analyze customer preferences and financial behaviors to understand the potential cross-selling opportunities. Finally, the prediction model can help the banks to identify good clients who may buy credit card and bad clients who are unlikely to pay the credits.
# Load the libraries
library(tidymodels)
library(tidyverse)
library(dplyr)
library(DataExplorer)
library(ggplot2)
library(lubridate)
library(rlang)
library(gridExtra)
library(cowplot)
library(caret)
library("pROC")
library(glmnet)
library(Matrix)
library(ROSE)
library(randomForest)
library(caret)
library(ranger)Warning: package ‘ranger’ was built under R version 4.2.3
account <- read.csv(file = 'data/account.csv',sep = ';')
card <- read.csv(file = 'data/card.csv',sep = ';')
client <- read.csv(file = 'data/client.csv',sep = ';')
disp <- read.csv(file = 'data/disp.csv',sep = ';')
district <- read.csv(file = 'data/district.csv',sep = ';')
loan <- read.csv(file = 'data/loan.csv',sep = ';')
order <- read.csv(file = 'data/order.csv',sep = ';')
trans <- read.csv(file = 'data/trans.csv',sep = ';')There are eight dataframes, I will observe them one after another to understand them.
# function print_df_info will print some basic information of a dataframe
print_df_info <- function(df) {
cat(dim(df)[1], "rows and", dim(df)[2], "columns\n")
cat("Data types:\n")
print(sapply(df, class))
cat("number of NAs in columns:\n")
cat(" ", colSums(is.na(df)), "\n", sep = " ")
cat(sum(rowSums(is.na(df)) > 0), "rows contain missing values\n")
if (length(df) > 0) {
for (col in names(df)) {
if (is.numeric(df[[col]])) {
cat("\nColumn", col, "\n")
print(summary(df[[col]]))
} else if (is.character(df[[col]])) {
cat("\nColumn", col, "unique Values:\n")
unique_values <- unique(df[[col]])
cat(" ", paste(unique_values, collapse = ", "), "\n")
}}} else {cat("No data in this dataframe.\n")}}
print_df_info(disp)5369 rows and 4 columns
Data types:
disp_id client_id account_id type
"integer" "integer" "integer" "character"
number of NAs in columns:
0 0 0 0
0 rows contain missing values
Column disp_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1418 2839 3337 4257 13690
Column client_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1418 2839 3359 4257 13998
Column account_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1178 2349 2767 3526 11382
Column type unique Values:
OWNER, DISPONENT
There are no NAs or blank cells. Data types are correct.
# generate a function to check relationship between two variables
check_relationship <- function(dataframe, column1, column2) {
if (all(table(dataframe[[column1]]) == 1) && all(table(dataframe[[column2]]) == 1)) {
cat("The relationship between", column1, "and", column2, "is one-to-one.\n")
} else if (any(table(dataframe[[column1]]) > 1) && all(table(dataframe[[column2]]) == 1)) {
cat("The relationship between", column1, "and", column2, "is one-to-many.\n")
} else if (all(table(dataframe[[column1]]) == 1) && any(table(dataframe[[column2]]) > 1)) {
cat("The relationship between", column1, "and", column2, "is many-to-one.\n")
} else {
cat("The relationship between", column1, "and", column2, "is neither one-to-one, one-to-many, nor many-to-one.\n")
}}
check_relationship(disp, "account_id", "disp_id")The relationship between account_id and disp_id is one-to-many.
The relationship between client_id and disp_id is one-to-one.
892 rows and 3 columns
Data types:
card_id disp_id type
"integer" "integer" "character"
number of NAs in columns:
0 0 0
0 rows contain missing values
Column card_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 229.8 456.5 480.9 684.2 1247.0
Column disp_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
9 1387 2938 3512 4460 13660
Column type unique Values:
classic, junior, gold
# remove junior credit card customers, because of the specific goals and strategies of the bank
card <- card %>% filter(type != 'junior')
# Transform 'issued' to a Date type
card$issued <- as.Date(card$issued, format = '%y%m%d')
# rename columns
colnames(card)[3:4] <- c("card_type","card_issued")
print(head(card,3))The relationship between card_id and disp_id is one-to-one.
# add column names
colnames(district) <- c("district_id","district_name","region","nr_inhabitants","nr_municipalities_1","nr_municipalities_2","nr_municipalities_3","nr_municipalities_4","nr_cities","ratio_urban_inhabitants","average_salary", "unemploy_rate_95","unemploy_rate_96","nr_enterpreneurs_per_1000_inhabitants","nr_commited_crimes_95", "nr_commited_crimes_96")
print(head(district,3))77 rows and 16 columns
Data types:
district_id district_name
"integer" "character"
region nr_inhabitants
"character" "integer"
nr_municipalities_1 nr_municipalities_2
"integer" "integer"
nr_municipalities_3 nr_municipalities_4
"integer" "integer"
nr_cities ratio_urban_inhabitants
"integer" "numeric"
average_salary unemploy_rate_95
"integer" "character"
unemploy_rate_96 nr_enterpreneurs_per_1000_inhabitants
"numeric" "integer"
nr_commited_crimes_95 nr_commited_crimes_96
"character" "integer"
number of NAs in columns:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 rows contain missing values
Column district_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 20 39 39 58 77
Column district_name unique Values:
Hl.m. Praha, Benesov, Beroun, Kladno, Kolin, Kutna Hora, Melnik, Mlada Boleslav, Nymburk, Praha - vychod, Praha - zapad, Pribram, Rakovnik, Ceske Budejovice, Cesky Krumlov, Jindrichuv Hradec, Pelhrimov, Pisek, Prachatice, Strakonice, Tabor, Domazlice, Cheb, Karlovy Vary, Klatovy, Plzen - mesto, Plzen - jih, Plzen - sever, Rokycany, Sokolov, Tachov, Ceska Lipa, Decin, Chomutov, Jablonec n. Nisou, Liberec, Litomerice, Louny, Most, Teplice, Usti nad Labem, Havlickuv Brod, Hradec Kralove, Chrudim, Jicin, Nachod, Pardubice, Rychnov nad Kneznou, Semily, Svitavy, Trutnov, Usti nad Orlici, Blansko, Brno - mesto, Brno - venkov, Breclav, Hodonin, Jihlava, Kromeriz, Prostejov, Trebic, Uherske Hradiste, Vyskov, Zlin, Znojmo, Zdar nad Sazavou, Bruntal, Frydek - Mistek, Jesenik, Karvina, Novy Jicin, Olomouc, Opava, Ostrava - mesto, Prerov, Sumperk, Vsetin
Column region unique Values:
Prague, central Bohemia, south Bohemia, west Bohemia, north Bohemia, east Bohemia, south Moravia, north Moravia
Column nr_inhabitants
Min. 1st Qu. Median Mean 3rd Qu. Max.
42821 85852 108871 133885 139012 1204953
Column nr_municipalities_1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 22.00 49.00 48.62 71.00 151.00
Column nr_municipalities_2
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 16.00 25.00 24.32 32.00 70.00
Column nr_municipalities_3
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 4.000 6.000 6.273 8.000 20.000
Column nr_municipalities_4
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 1.727 2.000 5.000
Column nr_cities
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 5.00 6.00 6.26 8.00 11.00
Column ratio_urban_inhabitants
Min. 1st Qu. Median Mean 3rd Qu. Max.
33.90 51.90 59.80 63.04 73.50 100.00
Column average_salary
Min. 1st Qu. Median Mean 3rd Qu. Max.
8110 8512 8814 9032 9317 12541
Column unemploy_rate_95 unique Values:
0.29, 1.67, 1.95, 4.64, 3.85, 2.95, 2.26, 1.25, 3.39, 0.56, 0.45, 3.83, 2.77, 1.42, 3.13, 1.12, 2.38, 2.83, 2.65, 1.51, 1.10, 1.79, 1.39, 2.47, 2.64, 0.65, 1.62, 2.82, 3.38, 3.52, 2.80, 5.75, 6.43, 1.02, 3.33, 4.46, 7.08, 7.34, 6.49, 3.32, 2.41, 1.72, 2.79, 2.28, 1.78, 1.89, 4.83, 2.51, 2.52, 2.53, 1.60, 1.88, 4.69, 3.73, 3.24, 3.45, 4.76, 1.29, 3.79, 5.74, 3.51, 5.77, 4.09, ?, 6.63, 5.93, 3.80, 4.75, 5.38, 4.73, 4.01
Column unemploy_rate_96
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.430 2.310 3.600 3.787 4.790 9.400
Column nr_enterpreneurs_per_1000_inhabitants
Min. 1st Qu. Median Mean 3rd Qu. Max.
81.0 105.0 113.0 116.1 126.0 167.0
Column nr_commited_crimes_95 unique Values:
85677, 2159, 2824, 5244, 2616, 2640, 4289, 5179, 2987, 3810, 3475, 3804, 1597, 6604, 1845, 1874, 1003, 1740, 999, 1563, 2299, 1089, 2879, 5198, 1822, 6041, 1029, 1580, 818, 2985, 1328, 4340, 4650, 5323, 3384, 5796, 4147, 2653, 4947, 6949, 6445, 1658, 4085, 2166, 2080, 2854, 6079, 1655, 1660, 2123, 3496, 2564, 1850, 18721, 3659, 3729, 2212, 2595, 1879, 2112, 2719, 1562, 4484, 2157, 2247, 3244, 5623, ?, 9878, 4980, 9672, 4355, 18782, 4063, 3736, 3460
Column nr_commited_crimes_96
Min. 1st Qu. Median Mean 3rd Qu. Max.
888 2122 3040 5031 4595 99107
The data types of columns unemploy_rate_95 and nr_commited_crimes_95 were incorrect. Convert them from categorical to numeric type. The two columns contain also ‘?’ character, they should be replaced with NA
# replace "?" with NA
district$unemploy_rate_95[district$unemploy_rate_95 == "?"] <- NA
district$nr_commited_crimes_95[district$nr_commited_crimes_95 == "?"] <- NA
# transfomr data type to numeric
district$unemploy_rate_95 <- as.numeric(district$unemploy_rate_95)
district$nr_commited_crimes_95 <- as.numeric(district$nr_commited_crimes_95)Now, I have to deal with the NAs in these two columns. According to my observation, using the ratio between unemploy_rate_95 and unemploy_rate_96 may be a good idea for NA imputation in column unemploy_rate_95. Use same strategy for the nr_commited_crimes_95.
First, calculate the ratios and check how they distributed.
unemploy_ratio <- district$unemploy_rate_95 / district$unemploy_rate_96
crimes_ratio <- district$nr_commited_crimes_95 / district$nr_commited_crimes_96
p_unemploy <- ggplot(data.frame(unemploy_ratio = unemploy_ratio), aes(x = unemploy_ratio)) +
geom_histogram(bins = 10, fill = "skyblue") +
labs(
main = "",
x = "Ratio (unemploy_rate_95 / unemploy_rate_96)",
y = "Frequency") + theme_minimal()
p_crime <- ggplot(data.frame(crimes_ratio = crimes_ratio), aes(x = crimes_ratio)) +
geom_histogram(bins = 10, fill = "skyblue") +
labs(
main = "",
x = "Ratio (nr_commited_crimes_95 / nr_commited_crimes_96)",
y = "Frequency") + theme_minimal()
grid.arrange(p_unemploy, p_crime, ncol=1)Ratio of unemploy_rate: It is slightly left skewed distributed. However, the distribution is from only 77 values. The ratio values mostly are between 0.7 and 0.9. I will use the median value of the ratio to impute the NAs in unemploy_rate_95 column.
Ratio of commited_crimes: It is light right-skewed distributed, with values mostly between 0.9 and 1.1.
I will use the median values of ratio to impute NAs.
# Impute NAs in unemploy_rate_95
district$unemploy_rate_95[is.na(district$unemploy_rate_95)] <- district$unemploy_rate_96[is.na(district$unemploy_rate_95)] * median(unemploy_ratio, na.rm = TRUE)
# Impute NAs in nr_commited_crimes_95
district$nr_commited_crimes_95[is.na(district$nr_commited_crimes_95)] <- district$nr_commited_crimes_96[is.na(district$nr_commited_crimes_95)] * median(crimes_ratio, na.rm = TRUE)
cat("Number of NAs in each column:",colSums(is.na(district)))Number of NAs in each column: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5369 rows and 3 columns
Data types:
client_id birth_number district_id
"integer" "integer" "integer"
number of NAs in columns:
0 0 0
0 rows contain missing values
Column client_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1418 2839 3359 4257 13998
Column birth_number
Min. 1st Qu. Median Mean 3rd Qu. Max.
110820 406009 540829 535115 681013 875927
Column district_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 14.00 38.00 37.31 60.00 77.00
client <- client %>%
mutate(
mid_two_digits = as.numeric(substr(birth_number, 3, 4)),
gender = ifelse(mid_two_digits > 50, "female", "male"), # add gender
birth_number_ = ifelse(mid_two_digits > 50, birth_number - 5000, birth_number),
birthday = as.Date(paste0("19", birth_number_), format = "%Y%m%d")) # add birthday as date type
# check if the wangling is correct
print(head(client, n = 5))# it seems correct, now drop the irrelavant columns
client <- client %>% select(-c(mid_two_digits,birth_number,birth_number_))
head(client, n = 3)The relationship between client_id and district_id is many-to-one.
4500 rows and 4 columns
Data types:
account_id district_id frequency date
"integer" "integer" "character" "integer"
number of NAs in columns:
0 0 0 0
0 rows contain missing values
Column account_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1183 2368 2786 3552 11382
Column district_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 13.00 38.00 37.31 60.00 77.00
Column frequency unique Values:
POPLATEK MESICNE, POPLATEK PO OBRATU, POPLATEK TYDNE
Column date
Min. 1st Qu. Median Mean 3rd Qu. Max.
930101 931227 960102 951655 961101 971229
# convert the 'date' column to Date format, rename
account$account_date <- ymd(account$date + 19000000)
account <- subset(account, select = -date)
colnames(account)[3] <- "account_freq"
print(head(account,3))The relationship between account_id and district_id is many-to-one.
682 rows and 7 columns
Data types:
loan_id account_id date amount duration payments status
"integer" "integer" "integer" "integer" "integer" "numeric" "character"
number of NAs in columns:
0 0 0 0 0 0 0
0 rows contain missing values
Column loan_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
4959 5578 6176 6172 6752 7308
Column account_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 2967 5738 5824 8686 11362
Column date
Min. 1st Qu. Median Mean 3rd Qu. Max.
930705 950704 970206 963028 971212 981208
Column amount
Min. 1st Qu. Median Mean 3rd Qu. Max.
4980 66732 116928 151410 210654 590820
Column duration
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.00 24.00 36.00 36.49 48.00 60.00
Column payments
Min. 1st Qu. Median Mean 3rd Qu. Max.
304 2477 3934 4191 5814 9910
Column status unique Values:
B, A, C, D
‘date’ column should be transformed to date type
# convert the 'date' column to Date format, rename
loan$loan_date <- ymd(loan$date + 19000000)
loan <- subset(loan, select = -date)
colnames(loan)[3:6] <- c("loan_amount", "loan_duration", "loan_payments","loan_status")
print(head(loan,3))The relationship between loan_id and account_id is one-to-one.
# drop duplicates, replace blank cells with NA
order <- unique(order)
order[order == ""] <- NA
print(head(order,3))6471 rows and 6 columns
Data types:
order_id account_id bank_to account_to amount k_symbol
"integer" "integer" "character" "integer" "numeric" "character"
number of NAs in columns:
0 0 0 0 0 0
0 rows contain missing values
Column order_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
29401 31188 32988 33778 34786 46338
Column account_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1223 2433 2962 3646 11362
Column bank_to unique Values:
YZ, ST, QR, WX, CD, AB, UV, GH, IJ, KL, EF, MN, OP
Column account_to
Min. 1st Qu. Median Mean 3rd Qu. Max.
399 24159184 49756062 49399037 74000448 99994199
Column amount
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1242 2596 3281 4614 14882
Column k_symbol unique Values:
SIPO, UVER, , POJISTNE, LEASING
There are blank cells in column k_symbol, should be replaced with NA.
In this data, k_symbol and amount are two important features. k_symbol stands for the payment types: - “POJISTNE” stands for insurrance payment - “SIPO” stands for household - “LEASING” stands for leasing - “UVER” stands for loan payment
So I will drop the columns which are probably irrelevant to the project: order_id, bank_to, account_to.
order$k_symbol[order$k_symbol == " "] <- NA
order <- order %>% select(-c(order_id, bank_to, account_to))
check_relationship(order, "account_id", "order_id")The relationship between account_id and order_id is one-to-many.
convert to a wide table, so that there is only one observation per account_id
1056320 rows and 10 columns
Data types:
trans_id account_id date type operation amount balance k_symbol bank
"integer" "integer" "integer" "character" "character" "numeric" "numeric" "character" "character"
account
"integer"
number of NAs in columns:
0 0 0 0 0 0 0 0 0 760931
760931 rows contain missing values
Column trans_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 430263 858506 1335311 2060979 3682987
Column account_id
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1204 2434 2937 3660 11382
Column date
Min. 1st Qu. Median Mean 3rd Qu. Max.
930101 960116 970410 965675 980228 981231
Column type unique Values:
PRIJEM, VYDAJ, VYBER
Column operation unique Values:
VKLAD, PREVOD Z UCTU, VYBER, , PREVOD NA UCET, VYBER KARTOU
Column amount
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 135.9 2100.0 5924.1 6800.0 87400.0
Column balance
Min. 1st Qu. Median Mean 3rd Qu. Max.
-41126 22402 33143 38518 49604 209637
Column k_symbol unique Values:
, DUCHOD, UROK, SIPO, SLUZBY, , POJISTNE, SANKC. UROK, UVER
Column bank unique Values:
, YZ, IJ, ST, UV, MN, OP, AB, CD, WX, GH, EF, QR, KL
Column account
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 17828584 45750951 45670919 72013407 99994199 760931
Following problems need to be solved: - Datatype of ‘date’ need to changed from integer to date type. - There are many NAs. - ‘type’ column should be either ‘PRIJEM’ or ‘VYDAJ’ values based on the reference. How should deal with ‘VYBER’. - ‘operation’, ‘k_symbol’, ‘bank’, ‘account’ columns have blank cells.
# 'date' column should be transformed to date type.
trans$trans_date <- ymd(trans$date + 19000000)
trans <- subset(trans, select = -date)
# replace "" and " " with NA
trans[trans == ""] <- NA
trans$k_symbol[trans$k_symbol == " "] <- NA
print(head(trans,3))na_counts <- colSums(is.na(trans))
na_proportions <- na_counts / nrow(trans)
barplot(na_proportions, names.arg = names(na_proportions), col = "skyblue",
main = "Proportion of NA Values by Column", ylab = "Proportion")There are NAs in columns operation (around 19% NAs), k_symbol (around 50% NAs), bank (around 78% NAs), account (around 76% NAs). Now let’s understand the meaning of the columns and how to deal with the NA data (impute NA or drop them).
‘opertion’ column: stores the mode of transaction. “VYBER KARTOU” credit card withdrawal; “VKLAD” credit in cash; “PREVOD Z UCTU” collection from another bank; “VYBER” withdrawal in cash; “PREVOD NA UCET” remittance to another bank.
‘k_symbol’ column: characterization of the transaction. “POJISTNE” stands for insurrance payment; “SLUZBY” stands for payment for statement; “UROK” stands for interest credited; “SANKC. UROK” sanction interest if negative balance; “SIPO” stands for household; “DUCHOD” stands for old-age pension; “UVER” stands for loan payment.
‘bank’ column: bank of the partner.
‘account’ column: account of the partner.
It seems, the content of k_symbo, bank, account are dependent on the operation type. check how they are related.
# count the NA by operation type
trans %>% group_by(operation) %>%
summarise(
Total_Count = n(),
NA_count_k = sum(is.na(k_symbol)),
NA_count_bank = sum(is.na(bank)),
NA_count_account = sum(is.na(account))) %>% print()From the upper table, we could see that: - though there are 183114 NAs in column operation. But in these 183114 rows, the k_symbol column values are all available , bank and account columns are all NAs. - VYBER KARTOU operation has only ‘account’ information available. - PREVOD Z UCTU and PREVOD NA UCET operations have almost no NAs in bank and account columns. - YKLAD operation has no available k_symbol, bank, account information.
My understanding is: - The availability of columns ‘bank’ and ‘account’ is largely dependent on ‘operation’ type. This indicates that these two columns may not provide meaningful information across all ‘operation’ types. - In addition, over 75% of these two columns are not available. This indicates that imputing the missing values might not provide accurate results. - So I will drop the columns ‘bank’ and ‘account’.
How should I deal with column k_symbol? First check the relationship between operation and k_symbol.
The columns ‘operation’ and ‘k_symbol’ have 14 unique combinations. All combinations seem reasonable and with large count (except VYBER-POJISTNE). However, there is not enough information to impute the NAs in the two columns. So I will bind the two columns as one variable (trans_detail) to solve the NA issue
trans <- trans %>% mutate(operation = ifelse(is.na(operation), ' ', operation), # replace NAs as blank before binding
k_symbol = ifelse(is.na(k_symbol), ' ', k_symbol),
trans_detail = paste(operation, k_symbol, sep = "-")) %>%
select(-operation, -k_symbol)
print(head(trans,3))Now I would check the data which were incorrectly labeled as ‘VYBER’.
All rows with ‘type’ of ‘VYBER’ have ‘trans_detail’ of ‘VYBER-’ (withdrawal in cash). All other rows which are labeled as ‘trans_detail’ of ‘VYBER-’ have the ‘type’ of ‘VYDAJ’ (withdrawal), which makes totally sense. I will replace all the incorrectly labeled ‘VYBER’ in column ‘type’ with ‘VYDAJ’.
The amount column in the transaction data is absolute transaction values, this means, deposits (positive) and withdrawals (negative) transactions will all be shown as positive. Absolute transaction amount reflects the total financial flow, and could be useful for assessing an client’s overall financial activity. However, it can not capture the transaction direction. So additionally, I will also introduce the real amount variable (including ‘+/-’) to understand savings or spending patterns of clients. The identification is based on the balance (increased/decreased after adding/minusing a transaction):
trans <- trans %>%
group_by(account_id) %>% arrange(account_id, trans_date) %>%
mutate(difference = balance - lag(balance, default = first(balance)),
`amount(+/-)` = ifelse(difference > 0, amount, amount * -1)) %>% select(-difference)
print(head(trans,3))The relationship between account_id and trans_id is one-to-many.
Before joining all dataframes, I would like to summarize some important relationships. - account_id and district_id: many-to-one - account_id and client_id: one-to-many - account_id and trans_id: one-to-many - account_id and order_id: one-to-many - account_id and loan_id: one-to-one - account_id and disp_id: one-to-many - client_id and disp_id: one-to-one - client_id and district_id: many-to-one - card_id and disp_id is one-to-one
# First join the dataframes disposition, card, client, district, account, order.
df1 <- disp %>% left_join(card, by = "disp_id", suffix = c("_disp","_card")) %>% select(-c(card_id,disp_id)) %>%
left_join(client,by = "client_id",suffix = c("_card","_client")) %>% left_join(district, by = "district_id") %>% select(-district_id) %>% left_join(account, by = "account_id") %>% select(-district_id) %>% left_join(order, by = "account_id") %>% left_join(loan, by = 'account_id', suffix = c("_account","_loan"))
print(head(df1,3))It worth noting that only owner can issue permanent orders and ask for a loan. This means, the loan dataframe should be joined only to ‘OWNER’. And I will drop the ‘DISPONENT’ type client, only keep the ‘OWNER’ type client. To keep the information ‘how many clients does an account have”, I will generate a new column ’n_clients’ to represent it.
df1 <- df1 %>% group_by(account_id) %>% mutate(n_clients = n())
barplot(table(df1$n_clients),
names.arg = names(table(df1$n_clients)),
xlab = "Number of Clients per Account", ylab = "Count", main = "Distribution: Number of Clients per Account ")The plot shows that, around 3700 account_id has only one client (OWNER), around 1800 account has two client (1 OWNER and 1 DISPONENT).
# check if it is true that DISPONENT clients have no credit cards.
df1 %>% group_by(type) %>% summarize(count_card = sum(!is.na(card_type))) %>% print()The result comfirmed my assumption: DISPONENT clients have no credit cards. Only OWNER clients could be issued credit cards. So I will drop all ‘DISPONENT’ clients rows. Afterwards the column type_disp will only contain one type ‘OWNER’, this means it is not anymore helpful for further analysis. So I will also drop the whole column type_disp.
Age is often an important factor in financial capacity analysis. I would add two variables ‘card_age’ and ‘loan_age’ to represent the client age when the card was issued, and the loan was issued.
df1$card_age <- as.numeric(difftime(df1$card_issued, df1$birthday, units = "days") / 365)
df1$loan_age <- as.numeric(difftime(df1$loan_date, df1$birthday, units = "days") / 365)
print(head(df1,3))This includes determination of purchase date and rollup window, defined by 1 month lag and 12 months history before credit card purchase.
Identify the existing credit card buyers:
buyers <- subset(df1, !is.na(card_type)) %>% select(c(account_id,card_type,card_issued))
nonbuyers<- subset(df1, is.na(card_type)) %>% select(c(account_id,card_type,card_issued))
cat("Number of Buyers:", nrow(buyers), "\n")Number of Buyers: 747
Number of Non-Buyers: 3753
Ratio of Non-Buyers to Buyers: 5
# Create a vector of values
barplot(c(nrow(buyers), nrow(nonbuyers)), names.arg = c("Credit Card Buyers", "Credit Card Non-Buyers"), ylab = "Count")ggplot(buyers, aes(x = card_issued)) +
geom_density(fill = "blue", alpha = 0.7) + # Density plot
labs(x = "Issued Date", y = "Density", title = "Distribution of Credit Card Issued Date") +
theme_minimal()There is a stable increase of credit cards purchase between 1994 and 1996. since 1996, the growth was rapid and reached the peak density between 1998 and 1999, indicating the period of highest credit card issuance activity.
In the whole dataset, 747 clients have already bought the credit cards, 3753 clients have not purchased credit cards. The ratio of 5 means there are approximately 5 times “nonbuyers” as “buyers” in the ‘card_type’ column. Dataset is severe imbalanced. For the further machine learning modelling, sampling methods (e.g. oversampling, undersampling …) or suitable evaluation metrics for imbalanced data (e.g. f1 score) must be considered.
Warning: Each row in `x` is expected to match at most 1 row in `y`.
We could see that a slow growth of the average monthly transaction in the first 11 months. In the last month, it increased rapidly from around 58000 to 117000. To further understand if the tendency is affected by outliers and skewed distribution, I will also check the median, upper- and lower-quartiles tendency.
# Define a function to calculate and plot statistics
calculate_and_plot_stats <- function(df,col,y_label,title,a,b) {
# Calculate mean, median, upper quartile, and lower quartile
stats <- df %>%
group_by(months_difference) %>%
summarize(
mean_amount = mean(!!sym(col), na.rm = TRUE),
median_amount = median(!!sym(col), na.rm = TRUE),
upper_quartile_amount = quantile(!!sym(col), probs = 0.75, na.rm = TRUE),
lower_quartile_amount = quantile(!!sym(col), probs = 0.25, na.rm = TRUE))
# Create a data frame for labels
label_df <- data.frame(
months_difference = -4,
label = c("Mean","Median", "Upper Quartile", "Lower Quartile"),
y = c(
stats$mean_amount[stats$months_difference == -2][1],
stats$median_amount[stats$months_difference == -2][1],
stats$upper_quartile_amount[stats$months_difference == -2][1],
stats$lower_quartile_amount[stats$months_difference == -2][1]))
# Create a line plot
ggplot(stats, aes(x = months_difference)) +
geom_line(aes(y = mean_amount), color = "black", linetype = "solid") +
geom_line(aes(y = median_amount), color = "blue", linetype = "solid") +
geom_line(aes(y = upper_quartile_amount), color = "red", linetype = "solid") +
geom_line(aes(y = lower_quartile_amount), color = "green", linetype = "solid") +
scale_x_continuous(breaks = -12:-1) +
labs(
x = "Month",
y = y_label,
title = title) +
geom_text(data = label_df, aes(label = label, x = months_difference, y = y), size = 3.5, vjust = -0.5, hjust = 0.5, color = c("black","blue", "red", "green")) +
theme_minimal() +
scale_y_continuous(limits = c(a, b)) +
theme(
axis.title = element_text(size = 10), # Adjust the axis label text size
plot.title = element_text(size = 12, hjust = 0.5))}
plot1 = calculate_and_plot_stats(buyers_trans_hist, "roll_monthly_trans", "Monthly Transaction", "Credit Card Buyers",8000,85000)
plot1The median, upper quartile, lower quartile lines show very similar tendency as the average transaction amount. But the median line is consistently lower than the average line. This means the data is left (negative) skewed distributed. In this case, it is more reasonable to use median to measure the tendency.
plot2 = calculate_and_plot_stats(nonbuyers_trans_hist, "roll_monthly_trans", " ", " Credit Card Non-Buyers",8000,85000)
common_title <- ggdraw() + draw_label("12 Months Rollup Transaction History", fontface='bold', size = 14)
bottom_row <- plot_grid(plot1, plot2, ncol = 2)
plot_grid(common_title, bottom_row, nrow = 2, rel_heights = c(0.2, 1))Analysis of non-buyers: All four statistics lines have very similar tendency, where median is constantly lower than mean, this indicates the data is negative (left) skewed distributed. In such case, median is more robust than mean to represent the data. non-buyers transaction tendency of the last 12 months is relatively stable with a peak at seventh month. The amount of 12th month and last month are obviously higher than the other months.
Summary based on the comparison of transaction rollup windows Buyers VS non-Buyers: - the monthly transaction amount of buyers (median: around 40,000 - 95,000) is much larger than the nonbuyers (median: around 18,000 - 29,000). - buyers and non-buyers have differently tendency over the last 12 months before credit card purchase. This means, transaction amount related features may be important to classify credit card buyers and non-buyers. For machine learning modeling, I will include comprehensive client-specific statistical feature engineering for the transaction amount. - both dataset are imbalanced, balancing methods or evaluation metrics for imbalanced data must be considered.
Based on the data we have, asset could be calculated like this: asset = balance - loan (if the account has loan). Of course, I should consider how much loan has already been paid for the time point of asset calculation. The “UVER” k_symbol in transaction stands for loan payment. With this we could calculate the remaining loan for each observations in transaction data.
trans_loan <- trans %>% left_join(loan, by = "account_id", suffix = c("_trans","_loan"))
colnames(trans_loan)[4] <- "trans_amount"
# split the transaction loan data into two parts
# part 1: the account had loan from the bank; here asset = balance - remaining loan
trans_loan_1 <- trans_loan %>%
filter(trans_date >= loan_date) %>%
arrange(account_id, trans_date) %>%
group_by(account_id) %>% mutate(
paid_loan = ifelse(grepl("UVER", trans_detail, ignore.case = TRUE), trans_amount, 0), # amount of paid loan in each transaction
accum_paid_loan = cumsum(paid_loan), # accumulated paid loan
remaining_loan = loan_amount - accum_paid_loan, # remaining loan
remaining_loan = ifelse(remaining_loan < 0, 0, remaining_loan), # replace negative remaining loan as 0, this means loan is totally paid.
asset = balance - remaining_loan) %>% ungroup() %>% select(-c(loan_id,paid_loan,accum_paid_loan,remaining_loan)) # asset
# part 2: the account which has never been issued a loan, and the account before loan is issued.
trans_loan_2 <- trans_loan %>% filter(trans_date < loan_date | is.na(loan_date)) %>% mutate(asset = balance) %>% select(-loan_id)
trans_loan <- rbind(trans_loan_1, trans_loan_2)
print(head(trans_loan,5))Now, I will join the asset data to the transaction data of buyers and non-buyers
buyers_trans <- buyers_trans %>% left_join(trans_loan[, c("trans_id", "asset")], by = 'trans_id')
nonbuyers_trans <- nonbuyers_trans %>% left_join(trans_loan[, c("trans_id", "asset")], by = 'trans_id')
print(head(buyers_trans,5))construct the assets using 12-month rollup window After each transaction, the asset of an accout get updated, this means each account contains many asset records. In this case it makes more sense to use average asset within a month to represent the monthly asset.
plot3 = calculate_and_plot_stats(buyers_asset_hist, "roll_monthly_asset", "Monthly Asset", "Credit Card Buyers",2000,12000)
plot4 = calculate_and_plot_stats(nonbuyers_asset_hist, "roll_monthly_asset", " ", "Credit Card Non-Buyers",2000,12000)
common_title <- ggdraw() + draw_label("12 Months Rollup Asset History", fontface='bold', size = 14)
bottom_row <- plot_grid(plot3, plot4, ncol = 2)
plot_grid(common_title, bottom_row, nrow = 2, rel_heights = c(0.2, 1))Compare Asset rollup history of buyers and non-buyers: - The buyers have significant higher asset (median: around 8300) than the nonbuyers (median: 4900 - 5500) - The asset of buyers is slightly increased over the 12 months, in contrast, the nonbuyers’ asset amount is decreasing overtime. - This means, the asset (slope, mean, median, quartiles) could be an important feature for identifying buyers and nonbuyers.
statistical feature engineering
Warning: There were 11 warnings in `summarize()`.
The first warning was:
ℹ In argument: `robust_trend_trans = coef(rlm(roll_monthly_trans ~ seq_along(roll_monthly_trans)))[2]`.
ℹ In group 15: `account_id = 108`.
Caused by warning in `rlm.default()`:
! 'rlm' failed to converge in 20 steps
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 10 remaining warnings.Warning: There were 59 warnings in `summarize()`.
The first warning was:
ℹ In argument: `robust_trend_trans = coef(rlm(roll_monthly_trans ~ seq_along(roll_monthly_trans)))[2]`.
ℹ In group 58: `account_id = 69`.
Caused by warning in `rlm.default()`:
! 'rlm' failed to converge in 20 steps
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 58 remaining warnings.Warning: There were 25 warnings in `summarize()`.
The first warning was:
ℹ In argument: `robust_trend_asset = coef(rlm(roll_monthly_asset ~ seq_along(roll_monthly_asset)))[2]`.
ℹ In group 6: `account_id = 65`.
Caused by warning in `rlm.default()`:
! 'rlm' failed to converge in 20 steps
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 24 remaining warnings.Warning: There were 61 warnings in `summarize()`.
The first warning was:
ℹ In argument: `robust_trend_asset = coef(rlm(roll_monthly_asset ~ seq_along(roll_monthly_asset)))[2]`.
ℹ In group 36: `account_id = 41`.
Caused by warning in `rlm.default()`:
! 'rlm' failed to converge in 20 steps
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 60 remaining warnings.
Warning: ‘MASS’ namespace cannot be unloaded:
namespace ‘MASS’ is imported by ‘ipred’ so cannot be unloaded
# join the statistical features to the other data by account_id
df <- df1 %>% left_join(trans_stats, by = 'account_id') %>% left_join(asset_stats, by = 'account_id')
print(head(df,3))# check again if one account_id has only one data point.
df_count <- df %>% group_by(account_id) %>% summarize(row_count = n())
if (any(df_count$row_count > 1)) {print("Some account_id values have more than one row.")} else {print("Each account_id has only one row.")}[1] "Each account_id has only one row."
Number of Rows: 4500
Number of Columns: 70
Total Number of NAs in the DataFrame: 42437
df <- df %>%
mutate(
loan_issuance = ifelse(is.na(loan_id), "No", "Yes"),
card_type = ifelse(is.na(card_type), "nonbuyers", "buyers"),
client_age = as.numeric(difftime('1999-01-01', birthday, units = "days")) / 365.0,
account_age = as.numeric(difftime('1999-01-01', account_date, units = "days")) / 365.0
) %>%
dplyr::select(-c(account_id, loan_id, loan_amount, loan_duration, loan_payments, loan_status, loan_date, loan_age, card_issued, card_age, birthday, account_date))Adding missing grouping variables: `account_id`
df <- df[, 2:ncol(df)]
# Round 'client_age' and 'account_age' to 1 decimal place
df$client_age <- round(df$client_age, 1)
df$account_age <- round(df$account_age, 1)
# Count NAs of incomplete columns
col_with_na <- df[, colSums(is.na(df)) != 0]
colSums(is.na(col_with_na)) order_SIPO order_UVER order_OTHER order_POJISTNE order_LEASING order_TOTAL
742 742 742 742 742 742
These 6 columns have numeric data. They represent the order amount of accounts. NAs indicate the order amount = 0. So here I will replace the NAs with 0
There are 0 NAs in data df.
Number of Rows: 4500
Number of Columns: 61
Data size is moderate.
[1] "client_id" "card_type"
[3] "gender" "district_name"
[5] "region" "nr_inhabitants"
[7] "nr_municipalities_1" "nr_municipalities_2"
[9] "nr_municipalities_3" "nr_municipalities_4"
[11] "nr_cities" "ratio_urban_inhabitants"
[13] "average_salary" "unemploy_rate_95"
[15] "unemploy_rate_96" "nr_enterpreneurs_per_1000_inhabitants"
[17] "nr_commited_crimes_95" "nr_commited_crimes_96"
[19] "account_freq" "order_SIPO"
[21] "order_UVER" "order_OTHER"
[23] "order_POJISTNE" "order_LEASING"
[25] "order_TOTAL" "n_clients"
[27] "mean_trans" "median_trans"
[29] "upper_quartile_trans" "lower_quartile_trans"
[31] "min_trans" "max_trans"
[33] "var_trans" "std_trans"
[35] "n_pos_changes_trans" "n_neg_changes_trans"
[37] "mad_trans" "sad_trans"
[39] "n_above_mean_trans" "n_below_mean_trans"
[41] "slope_trans" "robust_trend_trans"
[43] "mean_asset" "median_asset"
[45] "upper_quartile_asset" "lower_quartile_asset"
[47] "min_asset" "max_asset"
[49] "var_asset" "std_asset"
[51] "n_pos_changes_asset" "n_neg_changes_asset"
[53] "mad_asset" "sad_asset"
[55] "n_above_mean_asset" "n_below_mean_asset"
[57] "slope_asset" "robust_trend_asset"
[59] "loan_issuance" "client_age"
[61] "account_age"
In summary, available variables could be categoried as following: - Information of account owner: gender, birthday, age, residential district - Info of account: start date, account age, number of clients under the account_id - detailed information of residential district - permanant order amount by order types - transaction: 12-month-rollup statistical variables - asset: 12-month-rollup statistical variables
I will split the dataset into three parts to prevent data leakage: - Training set: used to train the predictive model. - Validation set: used for model fine-tuning and hyperparameter selection. - Test set: used to evaluate the final performance. To prevent the dependency on how train test are splitted, I will use 10-fold cross-validation in the modeling part to enhance the robustness of the model.
set.seed(267)
df$card_type <- factor(df$card_type, levels = c("nonbuyers", "buyers"))
# create stratified training and temp sets
index_train <- createDataPartition(df$card_type, p = 0.8, list = FALSE)
# keep a copy of test set with client_id for later Top-N clients.
# drop client_id for both train and testset, as it is not relevant to modeling.
df_train <- df[index_train, ] %>% select(-client_id)
df_test_clientid <- df[-index_train, ]
df_test <- df_test_clientid %>% select(-client_id)I will use logistic regression model as a baseline. Logistic regression requires numeric input variables. Therefore, I will first transform the non-numeric variables into numeric format.
unique_data_types <- unique(sapply(df, class))
cat("Data types of df:", paste(unique_data_types, collapse = ", "))Data types of df: integer, factor, character, numeric
cat_col <- colnames(df %>% select_if(is.character))
cat("Categorical variables and number of unique categories:\n")Categorical variables and number of unique categories:
gender district_name region account_freq loan_issuance
2 77 8 3 2
Categorical variables must be encoded for logistic regression. I will convert the binary features (card_type, gender, loan_issuance) directly to integer 1 and 0. For the multi-class features (district_name, region, account_freq), I will apply one-hot encoding to categorical features to avoid assigning the converted number any ordinal meanings.
There are 77 unique categories of district_name, this means, 77 columns will be generated by one-hot-encoding the feature district_name. In total, we will get more than 140 featuers. Depending on the machine learning selection, this might be not problematic. Tree-type models like decision tree and random forest could set tree depth to limit the usage of variables. Logistic regression could use regularization (e.g. L1) to automatically select relevant features and set others as 0.
df_encoded <- df %>%
mutate(gender = ifelse(gender == "male", 1, 0),
loan_issuance = ifelse(loan_issuance == "Yes", 1, 0))
df_encoded$card_type <- factor(df_encoded$card_type, levels = c("nonbuyers", "buyers"))
#encode the categorical variables
encoded_df <- data.frame(model.matrix(~ 0 + ., data = df_encoded[, c("district_name", "region", "account_freq")]))
# remove categorical variables
df_ <- df_encoded[, -which(names(df_encoded) %in% c("district_name", "region", "account_freq"))]
# full data with encoded categorical variables
df_encoded <- cbind(df_, encoded_df)
df_encodedAfter One-hot-encoding, the data includes 143 independent features.
# drop client_id in both train and test set.
df_train_encoded <- df_encoded[index_train,] %>% select(-client_id)
df_test_encoded <- df_encoded[-index_train,] %>% select(-client_id)
df_test_encoded$card_type <- factor(df_test_encoded$card_type, levels = levels(df_train_encoded$card_type))
x_train_encoded <- df_train_encoded %>% select(-(card_type))
x_test_encoded <- df_test_encoded %>% select(-(card_type))Split train and test dataset using same data indices as for the original data partitioning. So that it is comparable with later models.
As previously already analysed, the dataset is imbalanced, if directly using imbalanced dataset for classification may result bias toward the majority class, causing poor predictions for the minority class.
total_samples <- length(df_train$card_type)
n_buyers = length(df_train$card_type[df_train$card_type == 'buyers'])
n_nonbuyers = total_samples - n_buyers
weight_buyers <- total_samples / (2 * n_buyers)
weight_nonbuyers <- total_samples/ (2 * n_nonbuyers)
cat("There are ",n_buyers,"buyers,", n_nonbuyers, "nonbuyers\n")There are 598 buyers, 3003 nonbuyers
Weights of buyers: 3.01087
Weights of non-buyers: 0.5995671
ctrl <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE)
lr_model <- train(x= x_train_encoded,
y = df_train_encoded$card_type,
method = "glm", trControl = ctrl, metric = "ROC",
maxit = 10000,weights = class_weights)Warning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleadingWarning: non-integer #successes in a binomial glm!Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
the warning suggests some of the variables are highly correlated with others and may contribute to multicollinearity. So I will identify these variables and drop them.
It is a common preprocessing step in machine learning, including logistic regression, to ensure that all features have the same scale and contribute equally to the model’s performance.
# Define the preprocessing method (scaling to [0, 1])
preprocess_method <- preProcess(df_train_encoded_reduced, method = c("range"))
# Apply the preprocessing method to both the training and test sets
df_train_encoded_reduced_scaled <- predict(preprocess_method, df_train_encoded_reduced)
df_test_encoded_reduced_scaled <- predict(preprocess_method, df_test_encoded_reduced)
x_train_encoded_reduced_scaled <- df_train_encoded_reduced_scaled[,-which(names(df_train_encoded_reduced_scaled) == "card_type")]
x_test_encoded_reduced_scaled <- df_test_encoded_reduced_scaled[,-which(names(df_test_encoded_reduced_scaled) == "card_type")]metrics_table <- function(pred_labels, true_labels, idx, return_conf_matrix = FALSE){
expected_levels <- c("nonbuyers", "buyers")
pred_labels <- factor(pred_labels, levels = expected_levels)
true_labels <- factor(true_labels, levels = expected_levels)
conf_matrix <- confusionMatrix(pred_labels, true_labels, positive = 'buyers')
true_int <- ifelse(true_labels == 'nonbuyers', 0, 1)
pred_int <- ifelse(pred_labels == 'nonbuyers', 0, 1)
TP <- conf_matrix$table['buyers','buyers'] # True Positives
TN <- conf_matrix$table['nonbuyers','nonbuyers'] # True Negatives
FP <- conf_matrix$table['buyers','nonbuyers'] # False Positives
FN <- conf_matrix$table['nonbuyers','buyers'] # False Negatives
roc_obj <- roc(true_int, pred_int)
AuC <- auc(roc_obj)
result <- data.frame(
model_index = idx,
f1_score = conf_matrix$byClass["F1"],
accuracy = conf_matrix$overall["Accuracy"],
AuC = AuC,
kohenkappa = conf_matrix$overall["Kappa"],
precision = conf_matrix$byClass["Pos Pred Value"],
recall = conf_matrix$byClass["Sensitivity"])
if(return_conf_matrix){
return(as.data.frame(conf_matrix$table))
}
else{
return(result)}}Warning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -53); Convergence for 53th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -47); Convergence for 47th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -52); Convergence for 52th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -46); Convergence for 46th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -53); Convergence for 53th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -47); Convergence for 47th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -52); Convergence for 52th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -46); Convergence for 46th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -53); Convergence for 53th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -47); Convergence for 47th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -52); Convergence for 52th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -46); Convergence for 46th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -53); Convergence for 53th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -47); Convergence for 47th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -52); Convergence for 52th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -46); Convergence for 46th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -53); Convergence for 53th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -47); Convergence for 47th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -52); Convergence for 52th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -46); Convergence for 46th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -48); Convergence for 48th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returnedWarning: from glmnet C++ code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned
With grid searching, I get evaluation results of 30 models. The data is sorted by testset f1-score, AuC, accuracy, and kohenkappa. The eighth model has the best performance.
To better visualize the result, I will use line plots to visualize the upper table. Line plots are usually used to present trends. However, it’s perfectly fine to use lines to connect points within rows to distinguish between groups and highlight relationships in the data, because it makes the visualization clearer.
metric_order <- c("f1_score_train","f1_score_test","AuC_train","AuC_test","accuracy_train","accuracy_test", "kohenkappa_train", "kohenkappa_test", "precision_train", "precision_test","recall_train","recall_test")
eval_long <- lr_results %>%
pivot_longer(cols = all_of(metric_order), names_to = 'metrics', values_to = 'values')
ggplot(eval_long, aes(x = factor(metrics, levels = metric_order), y = values, group = model_index_train)) +
geom_line(aes(color = model_index_train)) +
geom_point(aes(color = model_index_train)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ggtitle("Logistic regression: Evaluation with 10-fold Cross Validation")Most of the lines in the plot have very similar values especially in the trainset, except one model with significant higher recall but lower accuracy AuC kohenkappa.
Based on the table and line plot, I will focus on the top 6 models with largest metric values.
Find the corresponding hyperparameters of the TOP 6 best models.
Select the first one as the best logistic regression model and explain it.
final_lr_model <- lr_models_list[[1]]
# explain the model
explainer_lr <- explain(final_lr_model, data = df_train_encoded_reduced_scaled, y=as.numeric(df_train_encoded_reduced_scaled$card_type), label='logistic regression model')Preparation of a new explainer is initiated
-> model label : logistic regression model
-> data : 3601 rows 19 cols
-> target variable : 3601 values
-> predict function : yhat.train will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package caret , ver. 6.0.94 , task classification ( default )
-> predicted values : numerical, min = 0.3164876 , mean = 0.490862 , max = 0.8129036
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.2094304 , mean = 0.675203 , max = 1.670787
A new explainer has been created!
Top 10 features by permutation importance ranking:
set.seed(123)
pfi_lr <- model_parts(explainer_lr, type = "variable_importance", N = 1000)
plot(pfi_lr[1:11,], show_boxplots = TRUE) + ggtitle("Logistic regression: Top 10 most important features")The plot tells, when mixing up the values of one feature, how much will the model negatively affected. Larger changes indicating higher importance of this feature. For example, the feature order_UVER has the highest importance here. The model with original full features has the AuC of 0.565. When mixing up the values of order_UVER, the model AuC is reduced to 0.539, the difference is 0.026. We could see that, only the top 5 features (order_UVER, order_SIPO, order_OTHER, nr_municipalities_4, nr_enterpreneurs_per_1000_inhabitants) have valid importance/influence/contribution to the model performance. The top 6 to 10 features have almost null influence.
Let’s observe the top 10 clients with the highest prediction probabilities.
final_lr_predictions_test <- lr_predictions_test_list[[idx[1]]]
Top_n_clients_lr <- final_lr_predictions_test %>% arrange(desc(buyers))
Top_n_clients_lrOverall, the prediction probabilities are not that high, with a maximal of 0.7.
Check exactly which features contributed to the predictions.
n = 10
plot_list <- list()
for (i in seq(1, n)) {
idx = Top_n_clients_lr[i,3]
client_id = Top_n_clients_lr[i,2]
bdp_lr <- predict_parts(explainer_lr, new_observation = df_test_encoded_reduced_scaled[idx, ])
p <- plot(bdp_lr) + ggtitle(paste("Logistic regression - Client_id ",client_id))
plot_list[[i]] <- p}
for (i in 1:n) {
print(plot_list[[i]])
}In all top 10 clients, the features order_SIPO, order_UVER, order_OTHER are the main contributors.
Receiver Operating Characteristic curve (ROC)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
Please use `reframe()` instead.
When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.
The x-axis FPR of ROC represents false positive rate, and the y-axis TPR for true positive rate.
The AUC measures the area underneath the entire ROC curve from (0,0) to (1,1). It is an aggregate measure of the model’s performance across all classification thresholds. An AUC of 0.5 suggests no discriminative power and an AUC of 1.0 represents perfect prediction.
In my ROC graph, the AUC is 0.53, which suggests that the logistic regression model is performing only slightly better than random guessing for the given task.
The confusion matrix tells, over half of the buyers (81 of 149) are false predicted as nonbuyers. Predictions on nonbuyers are slightly better, 257 of 750 nonbuyers are false predicted as buyers.
The best logistic regression has weak performance, indicating the relationship between the log odds of the dependent variable (buyers or nonbuyers) and the independent variables is not linear.
Decision tree could handle categorical data and unscaled data, so i will use the original not-encoded data with the same data spliting indices.
Attaching package: ‘rpart’
The following object is masked from ‘package:dials’:
prune
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Visualize the result in lineplots, each line represents one model
eval_long <- tree_results %>%
pivot_longer(cols = all_of(metric_order), names_to = 'metrics', values_to = 'values')
ggplot(eval_long, aes(x = factor(metrics, levels = metric_order), y = values, group = model_index_train)) +
geom_line(aes(color = model_index_train)) +
geom_point(aes(color = model_index_train)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ggtitle("Decision tree: Evaluation with 10-fold Cross Validation")The lines have very similar shape. The difference between train and test dataset is relatively large, because tree-type model is prone to overfitting.
Check the corresponding hyperparameters of the TOP 6 best models.
Select the model with index 19 as the best tree model, and explain it.
final_tree_model <- tree_models_list[[1]]
# explainer
explainer_tree <- explain(final_tree_model, data = df_train, y=as.numeric(df_train$card_type), label='tree model')Preparation of a new explainer is initiated
-> model label : tree model
-> data : 3601 rows 60 cols
-> data : tibble converted into a data.frame
-> target variable : 3601 values
-> predict function : yhat.rpart will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package rpart , ver. 4.1.19 , task classification ( default )
-> predicted values : numerical, min = 0.03350785 , mean = 0.166065 , max = 0.6025974
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.3974026 , mean = 1 , max = 1.966492
A new explainer has been created!
set.seed(123)
pfi_tree <- model_parts(explainer_tree, type = "variable_importance", N = 1000)
plot(pfi_tree[1:11,], show_boxplots = FALSE) + ggtitle("variable_importance", "")All top 10 important features have visible influence on the model performance. The sad_trans has significant dominant influence comparing to other features.
## Top-N clients
final_tree_predictions_test <- tree_predictions_test_list[[idx[1]]]
Top_n_clients_tree <- final_tree_predictions_test %>% arrange(desc(buyers))
Top_n_clients_treeThe top clients have very high predicted probabilities, even perfect.
10 features that contributed most to each of the top-10 clients
n = 10
plot_list <- list()
for (i in seq(1, n)) {
idx = Top_n_clients_tree[i,3]
client_id = Top_n_clients_tree[i,2]
bdp_tree <- predict_parts(explainer_tree, new_observation = df_test[idx, ])
p <- plot(bdp_tree) + ggtitle(paste("Decision tree - Client_id ",client_id))
plot_list[[i]] <- p}
for (i in 1:n) {
print(plot_list[[i]])
}For each client of the top-10, maximal 6 features are contributing to the prediction result.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = buyers, case = nonbuyers
Setting direction: controls > cases
AUC = 0.85 is relatively large, this means the decision tree prediction is much better than random guessing.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
The confusion matrix indicates, the model predicts better on nonbuyers group than the buyers group (weak recall).
The final decision tree model performs better than the baseline logistic regression model.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
eval_long <- forest_results %>%
pivot_longer(cols = all_of(metric_order), names_to = 'metrics', values_to = 'values')
ggplot(eval_long, aes(x = factor(metrics, levels = metric_order), y = values, group = model_index_train)) +
geom_line(aes(color = model_index_train)) +
geom_point(aes(color = model_index_train)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ggtitle("Random forest: Evaluation with 10-fold Cross Validation")We see two groups of lines from the plot. Comparing to decision tree, the difference between train and test dataset is smaller. This is because, random forest uses multiple number of decision tree models, to reduce overfitting.
Check the exact parameters of the top 6 best random forest models.
idx <- forest_results[1:6,'model_index_train']
top_6_parameters <- hyper_grid[idx,]
top_6_parametersBest random forest model
final_forest_model <- forest_models_list[[1]]
# explainer
explainer_forest <- explain(final_forest_model, data = df_train, y=as.numeric(df_train$card_type), label='random forest model')Preparation of a new explainer is initiated
-> model label : random forest model
-> data : 3601 rows 60 cols
-> data : tibble converted into a data.frame
-> target variable : 3601 values
-> predict function : yhat.train will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package caret , ver. 6.0.94 , task classification ( default )
-> predicted values : numerical, min = 0 , mean = 0.1038767 , max = 0.82
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.62 , mean = 1.062188 , max = 2
A new explainer has been created!
set.seed(123)
pfi_forest <- model_parts(explainer_forest, type = "variable_importance", N = 1000)
plot(pfi_forest[1:11,], show_boxplots = FALSE) + ggtitle("variable_importance", "")The top 10 features are mainly the account_age, and statistics features of transaction and asset. The most important feature account_age has an influence factor of around 0.0008. This is very small comparing to upper two models. This means, the random forest model uses a long list of features candidates with similar importance, where the decision tree and logistic regression relied on only a few features.
final_forest_predictions_test <- forest_predictions_test_list[[idx[1]]]
Top_n_clients_forest <- final_forest_predictions_test %>% arrange(desc(buyers))
Top_n_clients_forestThe probabilities are relatively good, but smaller than the decision tree.
Features contribution analysis for top-10 clients
n = 10
plot_list <- list()
for (i in seq(1, n)) {
idx = Top_n_clients_forest[i,3]
client_id = Top_n_clients_forest[i,2]
bdp_forest <- predict_parts(explainer_forest, new_observation = df_test[idx, ])
p <- plot(bdp_forest) + ggtitle(paste("Random forest - Client_id ",client_id))
plot_list[[i]] <- p}
for (i in 1:n) {
print(plot_list[[i]])
}For each client, the top 10 contributed features have actually very similar strength of contribution. The all other features together have as well significant contribution. This tells, there are a long list of important features contribute to the final prediction of each client.
AUC = 0.91, indicating a good prediction.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = buyers, case = nonbuyers
Setting direction: controls > cases
Confusion matrix
Setting levels: control = 0, case = 1
Setting direction: controls < cases
confusion matrix: The random forest could perfectly predict the nonbuyers group, but still over half of the buyers are false predicted as nonbuyers. This means, though with class_weights, the random forest could not perform that well on the minor group.
Over the evaluations of all six metrics on testset, random forest has much better performance than decision tree and logistic regression.
Attaching package: ‘e1071’
The following object is masked from ‘package:tune’:
tune
The following object is masked from ‘package:rsample’:
permutations
The following object is masked from ‘package:parsnip’:
tune
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Visualize the evaluation result of tuning results.
eval_long <- svm_results %>%
pivot_longer(cols = all_of(metric_order), names_to = 'metrics', values_to = 'values')
ggplot(eval_long, aes(x = factor(metrics, levels = metric_order), y = values, group = model_index_train)) +
geom_line(aes(color = model_index_train)) +
geom_point(aes(color = model_index_train)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ggtitle("SVM: Evaluation with 10-fold Cross Validation")A few models have larger metric values on testset than on trainset, which is a signal of underfitting. There is one model with significant higher values of all metrics than others and without underfitting sign. This should be the best model. But still I will check the hyperparameters of top 6 best models.
Find the corresponding hyperparameters of the TOP 6 best models.
best SVM model and explain it
final_svm_model <- svm_models_list[[1]]
# explainer
explainer_svm <- explain(final_svm_model, data = df_train, y=as.numeric(df_train$card_type), label='svm model')Preparation of a new explainer is initiated
-> model label : svm model
-> data : 3601 rows 60 cols
-> data : tibble converted into a data.frame
-> target variable : 3601 values
-> predict function : yhat.svm will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package e1071 , ver. 1.7.13 , task classification ( default )
-> predicted values : numerical, min = 0.0001697634 , mean = 0.1663878 , max = 0.9999999
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.0016064 , mean = 0.9996772 , max = 1.993597
A new explainer has been created!
set.seed(123)
pfi_svm <- model_parts(explainer_svm, type = "variable_importance", N = 1000)
plot(pfi_svm[1:11,], show_boxplots = FALSE) + ggtitle("SVM: Top 10 important features")The top 10 important features in SVM model are the statistics of transaction and asset. The importance difference between the top 10 features are mild, but all are significant. The top 1 feature max_trans can influence model performance with almost 0.1, around three times large as the top 10 feature var_asset.
final_svm_predictions_test <- svm_predictions_test_list[[idx[1]]]
Top_n_clients_svm <- final_svm_predictions_test %>% arrange(desc(buyers))
Top_n_clients_svmThe predictions of top clients are very high and similar, almost perfect.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = buyers, case = nonbuyers
Setting direction: controls > cases
ROC graph indicating a good prediction as well.
Plot confusion matrix
Setting levels: control = 0, case = 1
Setting direction: controls < cases
The SVM made much better prediction on buyers group than all other three models. This means, SVM could better handle imbalanced data.
order <- c("f1_score","AuC", "accuracy","kohenkappa","precision","recall")
metric_results$AuC <- as.numeric(metric_results$AuC)
eval_long <- metric_results %>%
pivot_longer(cols = all_of(order), names_to = 'metrics', values_to = 'values')
ggplot(eval_long, aes(x = factor(metrics, levels = order), y = values, group = model_index)) +
geom_line(aes(color = model_index)) +
geom_point(aes(color = model_index)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ggtitle("Comparison of four models")Compare the evaluation metrics of four models: - The performance ranking: SVM > random forest > decision tree >> logistic regression - The SVM (Support Vector Machine) model has the best performance comparing to other three models, as it has the highest F1 score, accuracy, AuC, Cohen’s kappa, recall, and second highest in precision. This possibly due to the SVM strengths: effective in high dimension space, less prone to overfitting, can handle imbalanced data. - The random forest has as well very good performance, it is only slightly worse than SVM. - Decision tree’s performance is worse than random forest, as a single decision tree is prone to overfitting, even though it performs well on trainset, usually hard to generalize well on testset. Random forests reduce variance by averaging multiple deep decision trees, each trained on a different part of the same training set. This results in a more robust model that generalizes better to new data compared to a single decision tree. - The logistic has much worse performance comparing to other three models. The reasons could be logistic regression models data with linear decision boundary. But in this dataset, the relationship between the dependent variable and independent variables is not linear but rather more complex interacted.
end_index_5 <- as.integer(nrow(df_test)*0.05)
end_index_10 <- as.integer(nrow(df_test)*0.1)
Top_5_clients_lr <- Top_n_clients_lr[1:end_index_5,]$client_id
Top_10_clients_lr <- Top_n_clients_lr[1:end_index_10,]$client_id
Top_5_clients_tree <- Top_n_clients_tree[1:end_index_5,]$client_id
Top_10_clients_tree <- Top_n_clients_tree[1:end_index_10,]$client_id
Top_5_clients_forest <- Top_n_clients_forest[1:end_index_5,]$client_id
Top_10_clients_forest <- Top_n_clients_forest[1:end_index_10,]$client_id
Top_5_clients_svm <- Top_n_clients_svm[1:end_index_5,]$client_id
Top_10_clients_svm<- Top_n_clients_svm[1:end_index_10,]$client_id
top5_lists<- c("Top_5_clients_lr","Top_5_clients_tree","Top_5_clients_forest","Top_5_clients_svm")
top10_lists<- c("Top_10_clients_lr","Top_10_clients_tree","Top_10_clients_forest","Top_10_clients_svm")
Attaching package: ‘reshape2’
The following object is masked from ‘package:tidyr’:
smiths
Based on the performance and top clients overlaps, I will choose the SVM model as my final best model, as - it has the highest f1-score, accuray, AuC, recall, and second highest precision on unseen new test data. - its predicted top clients are highly overlapped with random forest and decision tree models, this tells, its predictions criterias for identifying top clients are very likely reliable.
set.seed(123)
pfi_svm <- model_parts(explainer_svm, type = "variable_importance", N = 1000)
plot(pfi_svm[1:11,], show_boxplots = FALSE) + ggtitle("SVM: Top 10 important features")max_trans appears to be the most important feature, followed by sad_trans, sad_asset, slope_trans, upper_quartile_trans, as the top-5. It seems transaction, asset, and account age are very important for classifying buyers and nonbuyers in this case.
Gradually reduce the number of features based on their importance. Firstly, get the feature names sorted by importance in a vector.
df_pfi_svm <- as.data.frame(pfi_svm) %>%
group_by(variable) %>%
summarise(mean_dropout_loss = mean(dropout_loss)) %>% arrange(mean_dropout_loss)%>% slice(2:60)
sorted_variables <- df_pfi_svm$variableI will start with including only the top 10 important features for training and predicting the card types. Then including another 3 features in each step
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
svm_results_test_$AuC <- as.numeric(svm_results_test_$AuC)
df_long <- pivot_longer(svm_results_test_, cols = -model_index, names_to = "variable", values_to = "value")
# Create the line plot
ggplot(df_long, aes(x = model_index, y = value, color = variable)) +
geom_line() +
theme_minimal() +
labs(x = "Step (5 features per step)", y = "matric values", title = "Stepwise feature selection")At the step 2, model has increased to largest of AuC, f1-score, kohenkappa, and recall. After this step, the metric values are decreased. This means, with the selected features at step 2, the SVM model performs best on testset: With less number of features (step 1), model is likely underfitting. With more number of features (step 3 till 10), model is more overfitting, therefore generalize worse on new data.
[1] "card_type" "max_trans" "n_above_mean_trans" "sad_trans"
[5] "n_above_mean_asset" "slope_trans" "mad_trans" "n_pos_changes_asset"
[9] "lower_quartile_asset" "n_neg_changes_trans" "var_asset" "sad_asset"
[13] "n_below_mean_asset" "n_below_mean_trans" "mad_asset" "account_age"
svm_2 <- rbind(svm_results_test_[1,],metric_results[1,])
svm_2$model_index <- c("16 selected features","all features")
svm_2eval_long <- svm_2 %>%
pivot_longer(cols = -model_index, names_to = 'metrics', values_to = 'values')
ggplot(eval_long, aes(x = factor(metrics), y = values, group = model_index)) +
geom_line(aes(color = model_index)) +
geom_point(aes(color = model_index)) +
theme(axis.text.x = element_text(hjust = 1)) +
xlab("") +
ggtitle("Compare the effect of feature selection")Stepwise feature selection has largely reduced the overfitting, and improved the model generalization. The model performance on unseen dataset is largely increased comparing to the model with all features.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Finally the top 5% clients with the highest potential:
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Break down profile of the top 10 clients to understand the feature contributions in each case.
final_model <- svm_models_list_[[2]]
df_train_final <- df_train[selected_features]
df_test_final <- df_test[selected_features]
# explainer
explainer_final <- explain(final_model, data = df_train_final, y=as.numeric(df_train_final$card_type), label='svm model')Preparation of a new explainer is initiated
-> model label : svm model
-> data : 3601 rows 16 cols
-> data : tibble converted into a data.frame
-> target variable : 3601 values
-> predict function : yhat.svm will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package e1071 , ver. 1.7.13 , task classification ( default )
-> predicted values : numerical, min = 0.0002017855 , mean = 0.1657384 , max = 0.9999885
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.0954057 , mean = 1.000327 , max = 1.982359
A new explainer has been created!
n = 10
plot_list <- list()
for (i in seq(1, n)) {
idx = predictions_test[i,3]
client_id = predictions_test[i,2]
bdp_final <- predict_parts(explainer_final, new_observation = df_test_final[idx, ])
p <- plot(bdp_final) + ggtitle(paste("Break down profile - Client_id ",client_id))
plot_list[[i]] <- p}
for (i in 1:n) {
print(plot_list[[i]])
}set.seed(123)
pfi_final <- model_parts(explainer_final, type = "variable_importance", N = 1000)
plot(pfi_final, show_boxplots = FALSE) + ggtitle("Final SVM model: perfumation importance")After analyzing the 12-month rollup information, my final model found the following factors as the most important ones for identifying potential credit card buyers. It will be helpful for the bank staff to efficiently selling credit cards as well as other bank products.
transaction values and distribution: - maximal transaction: largest single transaction a client has made. A high maximum transaction might suggest a customer who makes large investments or significant purchases and could be interested in credit card or other products for high-value transactions. - sum of absolute difference: it reflects how much a client’s transactions go up and down. If this number is high, it means the customer’s transaction amounts change a lot. This could indicate someone who has irregular income or expenses and might need flexible banking solutions. - slope: it tells if a client’s transactions are generally increasing or decreasing over time. An increasing slope could mean the customer’s financial activity is growing, potentially a good candidate for savings and investment products. - upper quartile: it gives an idea of the client’s upper-range transaction amount. Customers with a high upper quartile might be more likely to buy premium banking products - median absolute deviation: it measures how varied a customer’s transactions are. A high value may indicate a financial life with a lot of ups and downs. - number of transactions are above the mean: If a client often has transactions above their average, they might be experiencing growth or have variable income. - median: it’s a good indicator of what a typical transaction looks like, which can help in understanding a client’s usual financial behavior. - variance: it measures how much a customer’s transactions differ from each other. A high variance means a lot of variability, possibly indicating a customer with unpredictable financial needs.
asset: - sum of absolute difference: similar to transactions, a high SAD of assets suggests the client’s account balance changes frequently, indicating a dynamic financial situation. - robust trend: it indicates whether the client’s assets are consistently increasing, decreasing, or staying about the same over time. A positive trend indicates signal financial growth, the client might be open to investment opportunities. - number of asset are positive: More positive records might indicate a stable or growing financial situation, suggesting they might be good candidates for additional financial products. - minimal value: it is the smallest amount the client has had in their account. It can help understand the customer’s financial lows and possibly offer products that help of lower balances. - standard deviation: it tells how much a client’s account balance varies. A client with a high standard deviation may need services in managing financial risk. - number of asset are above the mean: how often a customer’s account balance is above their average balance. A large number indicates, the client might be a good prospect for savings or investment products due to their habit of maintaining higher balances.
account_age: how long the client has had this account with the bank. A longer account age could indicate loyalty and familiarity with the bank’s services, and these clients might be more receptive to new offers. For the bank staff, these features provide clues about a customer’s financial health and behavior. Understanding these can help in suggesting the right products to the customers, such as savings accounts, investment opportunities, financial planning services, or even special programs for those with high transaction volumes.